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ABSTRACT 


The multivariate analysis techniques of cluster 
analysis, principal components analysis, and discriminant 
analysis are examined in this thesis. The theory and 
applications of each of the techniques are discussed. 
Computer software available at the Naval Postgraduate School 
is discussed and sample jobs are included. 

A hierarchical cluster analysis algorithm, available in 
the IMSL software package, is applied to a set of data 
extracted from a group of subjects for the purpose of 
partitioning a collection of 26 attributes of a weapon 
system into six clusters of superattributes. 

A nonhierarchical clustering procedure, principal 
components analysis, and discriminant analysis were all 
applied to a collection of data on tanks considering of 
twenty-four observations of ten attributes of tanks. The 
cluster analysis shows that the tanks cluster somewhat 
naturally by nationality. The principal components analysis 
and the discriminant analysis show that tank weight is the 


single most important discriminator among nationality. 
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I. DISCUSSION OF MULTIVARIATE DATA ANALYSIS 


A. INTRODUCTION 

As a set of statistical techniques, multivariate data 
analysis is concerned with data collected on several dimen- 
sions of the same observations. Techniques can be used for 
many purpose in the behavioral, mathematical, and adminis- 
Priaeive SCLeENces - ranging from rigidly controlled experi- 
Ments to explain relationships assumed to be present in a 
farge mass of data to attempts to cluster similar elements 
OG to find functions of the variables that will best 
discriminate among preselected subpopulations of the 
observations. 

The heart of any multivariate analysis consists of the 
faeemacrix. This matrix is a.tablesthat gives the results 
of a number of observations on a number of variables 


simultaneously (Table I). 
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Tllustrative Data Matrix 
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Variables 
Observations 1 Z Bere e, Javeei es 
: SEP Ia ali) iar sR era 
Z S2UehE22 andoaonstaar 2p 
‘ lee eles pa ip 
n x % x x X 
nl nz n3 nj np 


TABLE I. 


The table consists of a set of observations (the n 
rows) and a set of measurements on those observations (the 
p columns). Cell entries represent the value ij of 
observation i on variable j . The values are 
characteristics of the observations and serve to define the 
observations in any specific study. The cell values may 


consist of nominal, ordinal, interval, or ratio-scaled 


measurements, or various combinations of these across columns. 
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In a general sense "multivariate" analysis would con- 
cern two main features: 
1. The multivariate character lies in the 
multiplacity of the p variables, not 
in the size of the set n 
2. The variables are dependent among them- 
selves so that we can not split off one 
or more from the others and consider it 
by itself. The variables must be 
considered together. 
jghere are three characteristics often used as a basis 
for the classification of multivariate analysis: 
£. whether one's principal focus is on the 
objects or on the variables of the data 
matrix > 
2. whether the data matrix is partitioned 
into criterion and independent subsets, 
and the number of variables in each; 
So. whether, the céll values represent 
nominal, ordinal, or interval -scaie 
measurements. 
This classification results in four major subdivisions of 
interest: 
i) single criterion, multiple predictor 
association, including multiple regression, 
analysis of variance and covariance, and 


two-group discriminant analysis; 
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2. multiple criterion, multiple predictor 
association, including canonical correla- 
tion, multivariate analysis of variance 
and covariance, and multiple discriminant 
analysis; 

3. analysis of variable interdependence, 

including factor analysis, multidimen- 
sional scaling, and other types of 
dimension-reducing methods; 

4. analysis of interobject similarity, 

including cluster analysis and other 
types of grouping procedures. 

The first two categories involve dependence structures 
where the data matrix is partitioned into criterion and 
independent subsets; in both cases interest is focused on 
the variables. The last two categories are concerned with 
interdependence - either focusing on variables or on 
observations. Within each of four categories, various 
techniques are differentiated in terms of the type of scale 
assumed. 

In this research, we consider only the following 
techniques of multivariate analysis: 

1. Principal components analysis 

2. Discriminant analysis 


3. Cluster analysis 
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II. PRINCIPAL COMPONENTS ANALYSIS 


The basic idea of principal components analysis is to 
Seseripe the dispersion of an array of n points in p- 
dimensional space by introducing a new set of orthogonal 
linear coordinates so that the sample variances of the 
given data points with respect to these derived coordinates 
are in decreasing order of magnitude. Thus the first 
Principal component is such that the projection of given 
points onto it have maximum variance among all possible 
linear coordinates; the second principal component has 
maximum variance subject to being orthogonal to the first; 
and so on. 

Suppose that the random variables Xy m X, Te ear, , 
XS of interest have a certain multivariate distribution 
with finite mean vector u and variance-covariance matrix 
: 

From this population a sample of n independent 
observation vectors has been drawn. The observation can 


be written as the usual nxp data matrix. 
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X44 Raat neketer hvadeuie fowerenene Xip Xy 

X= mas (4) 
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The estimate of £ will be the usual sample variance- 


Covariance matrix S defined as follows: 


~ 1 
S==—TA 
. (2) 
A = f (X, - x) OK, aa X) 


The information we shall need for our principal compo- 
nents analysis will be contained in S . However, it will 
be necessary to make a choice of measures of dependence: 
should we work with the variances and covariances of the 
observations, and carry out the analysis in original unit 
of the responses, or would a more accurate picture of the 
dependence pattern be obtained if each Xi were trans- 


formed to a standarized score 


and the correlation matrix R employed? The components 
obtained from S and R in general not the same, nor is 
it possible to pass from one solution to the other by a 
simple scaling of the coefficients. 

If the responses are in widely different units (ae eS. 
number of crew, weight in tons, speed in kilometer per 
hour, etc.) with large differences in the magnitudes, 


linear compounds of original quantities would have little 
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meaning and standarized variates and correlation matrix 
should be employed. Conversely, if the responses are 
reasonably commensurable, the covariance form has a greater 
statistical appeal, for the i-th principal component is 
that linear compound of the responses which explains the 
i-th largest portion of the total response variance, and 
maximization of such total variance of standard scores is 
e2ener artificial ; 


The first principal component of the complex of sample 


values of the responses X} ; X, Pema > Sra sin Bene : x is the 
linear compound 
eas 444%) Pedic. blew eee + 251%» (3) 


whose coefficients ai, are the elements of the eigenvector 
associated with the greatest eigenvalue Ay of the sample 
Mettance-covariance matrix of the responses. The ai, are 
are unique up to multiplication by a scale factor, and if 
they are scaled so that a',a, = 1, =the eigenvalue Ay is 
interpretable as the sample variance of Yj, 

Numerical representation of the first principal compo- 


nent is to find the vector Ay such that 


(4) 


i 
> 
! ~ 
Ps 


14 


i a - ry _— oi _ 


So. zietnim ee ate 


i 
- y STE Be binioadltes “ete 7 sedi | 
| ?s 2 vad wh siemens! & a r aa a . 
a : ai 3NSHOgdOS Iegiohits aed otis Ww? , Las 
2itecqes dotdw <8 210% zat 9620 Saiog 
bse ,s2naiy sno gees fated autt Rapin 
ww , 


bratmeiyé te sortateay eto da 


nd 
7 A ; * ,wrakn © rf¢ie 
a _ * = 
an 
a 


om as 7 

2 — _—_ 

), tipo: ses. 
a a i 


1 ae 
iP crt Ifsoorst= 2h? “ORS iD oo e726 
8 fr a 


io. sufsvite gis testaorg ats 2 6 yorE ESS 


Al 
~ 


: -s6 Of -. 2 emeneet avis t0 eErtom sonal 
bee ,roter? siggé & fe: esas pages Lum: a) ta J 
icvneuie off «af ss eo judd nt SOEs 


re me sansinen, igs a nis ok okabs 2147 I 


Pere co We eee at rt 
-ormeos beqiuaita sett 2, o- eineeetget In: of 
mie, ) od hed: pad 


a y ; i”! ie - ’ 
“gotgev 9 Bela od 2k 


So 7 
cont | ec Y= 


ss a ce so 


= ae vive @- 7 
Sab ail ; isa f - 


== 


which maximizes sample variance 


ne : 
= EE ae eae 
Yu i21 g=1 5 Dia ils 
(5) 
= ' 
A 1541 
Home cL SCOGELICIent vectors: normalized so that “A’.A, = "1 


i aa | 
To determine the coefficients, the normalization constraint 


is introduced by means of Lagrange multiplier and the 


resulting expression is differentiated with respect to A', : 


2 


3 = 9 
ae 1 pe €0N-(6) 


= 2(S = ATA, 


The coefficients must satisfy the p simultaneous linear 


equations. 


(S - A,I)A, = 0 (7) 


If the solution to these equation is to be other than the 


null vector, the value of hy must be chosen so that 


[Seaaia| =-0 (8) 


r is thus an eigenvalue of the variance-covariance matrix, 


L 


and A is its associated eigenvector. To determine which 


1 
of the p eigenvalues should be used, premultiply the 
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the system of equation (7) by A,' . olnmce -A 


follows that 


But the coefficient vecotr was chosen to maximize this 
variance, and therefore, A, must be the greatest eigen- 
walue of S 


The second principal component is that linear compound 


ne (9) 


whose coefficients have been chosen, subject to the con- 


straints 


(10) 


so that the variance of Y> ; A,!' S A, » is a maximum. 
The first constraint is merely a scaling to assure the 
uniqueness of the coefficients, while the second requires 
that Ay and A, be orthogonal. 

The coefficients of the second component can also be 
found by the Lagrangian technique with two multipliers i, 


and u. Differentiating this with respect to A, gives: 


16 


_ 


ty 


a 


= - »~ 2 
& —— 


ates 


D 


razeoqmed: & 


_ 


7 


_fe2ors aGead 
a , 


r 


rat 


7 


2 
— 
—? 
a 
> +) 
oo = 
= 
“" ee 
i. = 
ft 5% 
¢} _ 


oeaes S A, + A, (1 - A,' A,) * yea) A 


2 (iT) 
=~2(Si = A,T)A, + HA, 


If the right-hand side is set equal to 0 and premultiplied 


by A, ' , it follows from the normalization and orthogonality 


conditions that 


DS Sh a eo (12) 


Similar premultiplication of the equation (7) by A,!' 


implies that 


A,' SA, = 0 GES) 
and hence yu = 0. The second vector must satisfy 
(Sie a = 0 (14) 


And it follows that the coefficients of the second component 
are thus the elements of the eigenvector corresponding to 
the second greatest eigenvalue. The remaining principal 
components are found in their. turn in the same manner from 
the other eigenvectors. 

thus the j-th principal component of the Sample of 


p-variate observations is the linear compound 


iam Ju: Oe, See mee ae re (15) 


17 


- 7 4 le io 
7 : ' 
| 7 ; : a 
is : [ 
- - - 
: 7 i - 4 | - 


teu worse bas 9 


oe 


whose coefficients are the elements of the eigenvector of 


the sample variance-covariance matrix S corresponding to 


J 
eeecrererents Of ‘the i-th and j-th components are 


the j-th largest eigenvalue A. . If hs = ; , the 


necessarily orthogonal; if hae x5 : 


chosen to be orthogonal, although an infinity of such 


the elements can be 


orthogonal vectors exists. The sample variance of the 
j-th components is rj , and the total system variance is 


thus 


hy Ae BA cy en ee ee +N = tr Ss (16) 


The importance of the j-th component in a more parsimonious 


description of the system is measured by 


r 


Ts a 


which gives the fraction of the total variance contributed 


to the j-th component. 
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III. DISCRIMINANT ANALYSIS 


A. INTRODUCTION 

DWesbasic idea of discriminant analysis consists of 
assigning an individual from a group of individuals to one 
of several known or unknown distinct propulations, on the 
basis of observations on several characters of the indi- 
vidual or group and a sample of observations on these 
characters from the populations if these are unknown. 

Fisher (1936) was the first to suggest a linear 
function of variables representing different characters, 
hereafter called the linear discriminant function (discrimi- 
nator) for classifying an individual into one of two popu- 
lations. Later research extended the analysis to classifica- 
tion into one of k populations. 

For the univariate case Fisher suggested a rule which 
classifies an observation x into the i-th univariate 


population if 


fk. = min (% = 3 Rca he le agus edad OEMS) 


it > 


where X, is the sample mean based on a sample of size Nj, 
from i-th population. For two p-variate populations 

Ty) and T> (with the same covariance matrix) Fisher 
replaced the vector random variable by an optimum linear 


combination of its components obtained by maximizing the 
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ratio of the difference of the expected values of a linear 


combination under T™, and ™, to its standard deviation. 
He then used his univariate discrimination method with this 
optimum linear combination of components as the random 
variable. 

Rao (1948) considered the problem of classifying 
people into one of these populations castes of India. He 
assumed that each of the three populations could be 
characterized by four variables - structure (x1); Sint tae 
height (x), nasal depth (xz), and nasal height (x4) - of 
each member of the population. On the basis of sample 
observations on these characters from the three populations 
the problem is to classify an individual with observation 
X = eer ty hasty). into one of three populations. He used 


mM iinear discriminator to obtain the solution. 


Be © STHEORY 
In general, the underlying assumptions of discriminant 
analysis are: 
1. the groups being investigated are discrete 
and identifiable; 
2. each observation in each group can be 
described by a set of measurements on p 
characteristics or variables; 
3. these p variables are assumed to have 
a multivariate normal distribution in each 


population. 
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The purposes of discriminant analysis are: 

1. to test for mean group differences and to 

describe the overlaps among groups; 

2. to construct classification schemes based 

upon the set of p variables in order to 

assign previously unclassified observations 

to the appropriate groups. 
Hence, the problem of studying the direction of group 
differences is, equivalently, a problem of finding a linear 
combination of the original independent variables that 
shows large differences in group means. In short, dis- 
criminant analysis is a method for determining scuh linear 
combinations. 

The first step toward determining a linear combination 
of a set of variables such that several group means on this 
linear combination will differ widely among themselves, is 
to decide on a criterion for measuring such group-mean 
differences. Once a linear combination has been constructed, 
that means there is just a single tranformed variable. 
Hence, the F-ratio for testing Che significance of the 
over all difference among several group means on a single 


variable suggests an appropriate criterion. 


eVIBY J 
Ee ee (19) 


where V' = (Vy,Vaee-eeeee reese , v.), a set of weight which 


maximizes ir. 
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Xij is the jth observation vector in the i-th 
group. 

x is the grand mean vector of the data. 

G is the number of groups. 

n; is the number of observations in the ith group. 

Prime notation indicates transpose. 
itseratzto 2’ , called the discriminant criterion, was 
originally proposed by Fisher in connection with his two- 
group discriminant function. Once a criterion for group 
differentiation has been determined, a set cf weights, 
(vy Vz oo tteeeeees i Vp)» which maximizes this criterion, 
should be determined. This is accomplished by taking the 


partial derivative of A with respect to each component 


V; Grav and Setting the result equal to zero. 
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which is equivalent to 


(B - AW)v = 0 
. C241) 
(wtp - Avl)v = 0 
This equation is of the form 
(A - XI)v = 0 (72) 


It's solution, yielding the eigenvalues A and associated 
elgenvectors Vp of the matrix A , is therefore the same 
as in the principal components analysis, and thus the solved 
problem satisfies the problem of maximizing the discriminant 
ericerion. 

In the last equation, the number of non-zero eigenvalues 
of a square matrix A is equal to the rank of A. With 
wtp paaying the role of” A, the number Of non-zero eigen- 
values depends on the rank of B , since the rank of the 
product of two matrices can not exceed the smaller of the two 
factor matrices’ ranks, and wl (being nonsingular) must be 
Ot til rank p , while the rank of B is usually smaller 
rian bp 2 tnus it. is possible to denote the mankiof B by 
* = min (G-1,p) 

From the fact that the eigenvalues do are the values 
assumed by the discriminant criterion for linear combination 


using the elements of the corresponding eigenvectors P 


as combining weights, it is clear that the eigenvector 
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a = (V149Vy2> Seer he hers V1) provides a set of weights such 


ines! V12%> tn do auld atopee cel é + V1 X (23) 


has the largest discriminant-criterion, A , achievable by 
any linear combination of the p independent variables. 
What are the properties of the remaining eigenvectors, 
Va 2Vareeeees »v. ? The second discriminant function 
a5 = Voi%41 + V52X, Srrareiheteeatacs + VapXp whose weights are the 
elements of the eigenvector V, associated with the second 
largest eigenvalue 1, of wtp has the largest 
discriminant-criterion among those linear combinations of 
the X; that dare uncorrelated with the first discriminant 
fmetton ian the total. sample observation. Its proof is 
analogous of that of princpal components analysis. Each 
Giscriminant function has a relative (or conditional) 
Masinum value for ats discriminant criterion. Therefore, it 
needs nonly to show that Y, is uncorrelated with Yy 
Noting that this correlation is proportional to v, ‘Iv 


Z 
(where T = W +B), we have to prove that v,'Tv, = 0 


(Bs = h,W)v; = 0 for each a (24) 


hence, 


24 


Ee Pe 
. ; : A Sr 


7 ee Bye a R ‘ 
f2e2e c25n30u te ee ad 


S94 i» goinl ome? odd 2% etre 4 
gota trnabebt roabh baayse ‘oat, im a3 
tow seornw q Ke Cia+s mastgh as* pet 
¢ iste | ateincees cv cnrsernegie oft o 
egvat eff sad at WwW. 30 ok eule 
- zerohisntdwes 20amk? seedy ange ae am contac 
Treads ‘b revi? edt diy bereie “ye7Kw OTe Jans os 
r «ygpoge Pree de. iq whe (4208 ee b nates 
fs sieviens’ ssnesasnod teqoaieg 9% m, 96 a0 HORE 
anor? theo. Fi ‘9. " whined eet be wnt Pare wt 3 3 Did nim: 
$i re lee wey 9 shensabnasib et $i ‘num 


he 


Dt a ae vam af ie oe ‘ont ay 5 12 ak 
i & eet - “mo ladies iF jes 
ia : aS ; 


premultiplying these equations by v,' and v,' respectively, 


fina is 
Vo BY, hyVv'Wv 


(25) 
v,'Bv, = hav, 'Wv, 


taking the transpose of both sides of the first equation (B 


and W are symmetric) 


vw. 'BYe Sin 


at get pia 
thus 
AiVz_' WV, = AoV,'WV, 
0A, A,JV,'W, 0 
since hy = ho» V,'W, = 0 


therefore, V,'W, = 0 which means Vi and V5 are 
uncorrelated, and Yo has this property: its discriminant- 
criterion value, i, , is the largest achievable by any 
linear combination of X's that is uncorrelated (in the 


total sample) with Y; - Similarly 


x (36) 


has the largest possible discriminant-criterion value (Az) 
among all linear combinations of the X's that are uncorrelated 


with both Yy and Y4; and so on until Y, using the 
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etements of V, as weights, has the largest possible 
discriminant-criterion value among linear combinations that 
are uncorrelated with all the preceding linear combinations 


YyoYooee seer eo Yay . The linear combinations Y Vox trees 


1? 
ete aves Caltled@thentirst. Sécond. 8. he. ‘nrtethen (linear) 
discriminant functions for optimally differentiating among 

tne *¢ given groups. 

The situation here is reminiscent of principal com- 
ponents analysis. There, the dimension corresponding to the 
first component had maximum variance; the second-component 
dimension had maximum variance among those uncorrelated with 
the first; and so on. In discriminant analysis, the ratio 
of between-to within-groups sums-of-squares merely takes the 
placelot Variance! as the criterion) ini determining: the 
successive dimensions. However, an important difference 
between the dimensions identified in discriminant analysis 
and those in component analysis is that the former are 
generally not mutually orthogonal in the test space, even 
though they are uncorrelated. That is, the axis representing 
the discriminant functions are not a subset of axes obtainable 
by rigid rotation of the original system of p axes; the 
discriminant rotation in an oblique rotation. 

Just as in the principal components analysis, the 
dimensions represented by the discriminant functions may be 
interpreted meaningfully. Even if they are not, it may be 
possible to achieve parsimony by reducing the dimensionality 


of the space needed to describe group differences. In 
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seeking to interpret the discriminant functions, the goal is 
to determine which of the original p variables contribute 
most to each function. For this prupose, comparison of the 
realtive magnitudes of the combining weights as given by the 
elements of each eigenvector of WB is inappropriate 
because these are weights to be applied to the variables in 
raw-score scales, and are hence affected by the particular 
unit used for each variable. 

To eleminate the spurious effects of units of measure- 
ment on the magnitudes of combining weights, standarized 
variables should be used. 

The relative magnitudes of these standarized weights may 
be assessed by multiplying each raw-score weight by the 
standard deviation of the corresponding variable as computed 
from the within-groups SSCP (Sum of Squares, Cross product) 
faecix) his amounts to multiplying each element of a given 


eigenvector V_ by the square root of the corresponding 


m 
dvagonal €lement of ,W .\,Thus, for each m , define 


Vi = Wie Vis al WS De ee ees »D (27) 


as the standarized discriminant weights. The relative con- 
tribution of the i th variable to the mth discriminant 


function may then be gauged by the magnitude of ve in 


comparison with the other weights va 
Up to this point, it has been shown that the dimension- 


ality of the discriminant space is equal to the number of 
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nonzero eigenvalues of wtp » which is the smaller of the 


two numbers, G-1 and p. It may often happen, that the 
number of significant discriminant dimensions may be even 
smaller. That is, not all of the discriminant function may 
represent dimensions along which statistically significant 


eroup differences occur. 


C. SIGNIFICANCE TEST IN DISCRIMINANT ANALYSIS 

A basic quantity in testing the significance of the 
overall difference among several group centroids (mean 
vectors) the ratio of the determinants of the within- 
groups and the total SSCP matrices, known as Wilks' A 


criterion. 


A= H (28) 


Motivation for use of this equation may be seen as follows: 


=| 
MN 
= 


= \wlw + B)| (29) 


=e(l ft A) CL. ho)e-eee-, 0 + h,) 


where Ayorgorerrdy are the nonzero eigenvalues of wtp 
Consequently, Bartlet's V Statistzc £0r testing the 


significance of an observed value can be expressed as 
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fas IN aol = (p+ G)/Zi inn 
oe = (pt C72) inet) Au) + hd] (30) 
r 
=n = -L.- (p + G)/2)] = Intl + Am) 
m= 1 


This statistic is distributed approximately chi-square with 
p(G-1) degrees of freedom. 

Because of the uncorrelatedness of the successive 
discriminant functions, the successive terms I1n(l + i) 
in the last expression above are statistically independent 
(assuming multivariate normality of the original p 
variables). As a result, the additive components of V are 
each approximately distributed as a chi-square variate. 
More specifically, the m th. component, 

Me [Nos Al = (pet (G72 adn (i oh) (31) 
is approximately chi-square with p +G- 2m degrees of 
freedom. It may be readily verified that the sum of the 
number of degree of freedom (n.d.f) of the r components, 
Ghatiison (pes Ge=. 2)e4 (p 4.602 4)c mon. ee. +p GuiGeeagr) , 
ash equal. tol’ p(G'="1)* regardless) of whether)! r=rG - 1 or 
Pp 

Consequently, when we cumulatively subtract V1»V>2 ; 
and so on from V , the remainder each time is also a 
chi-square variate; and these successive remainders become 
appropriate statistics for testing whether the residual 


discrimination after removing the first discriminant 
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function, the first and second discriminant function, and 
so forth, is statistically significant. The successive 


test statistics and their n.d.f.'s may be summarized as 


follows: 
Residual After Approximate 
Removing -HStaeusiec n.d.f. 
egg nant V- Vy p(G-1) - (ptG-2) 
ion = (p-1)(G-2) 
fee ane V - Vi - V5 (p-1) (G-2) - (p+G-4) 
=(p-2) (G3) 
ae discriminant V - Vee - V5 - Vz (p-2) (G-3)- (p+G-6) 
unction =(p-3) (G-4) 


First s discriminant NEL = Vine Wigs cae (p-s) (G- (s+1)) 
Function S 


As soon as the residual, after removing the first 5 
discriminant functions becomes smaller than the prescribed 
percentile point (that is, the 100(1 - a)th percentile) of 
the appropriate chi-square distribution, we may conclude 
that only the first s discriminant functions are 
Sremificant at that a level. If the number of signifi- 
cant discriminant functions thus found is smaller than fr 
(as will often be the case), we will have effected a further 
reduction in the dimensionality of the space required to 


describe the differences among the G groups from which 
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our sample groups were drawn. The remaining r-s dimensions 
may be regarded as immaterial for population differentiation, 
since our sample differences along these dimensions can be 


attributed to sampling error. 
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IV. CLUSTER ANALYSIS 


A. ORIGIN AND THEORY 

Clustering is the grouping of similar objects. The 
principal functions of clustering are to name, to display, 
to summarize, to predict, and to aid in interpretation of 
data with many dimensions. Clustering techniques were 
Hisst developed in the field of biological ‘taxonomy..: It is 
one of several methodologies included in the broader cate- 
gory called classification. 

The cluster analysis problem is the last step we 
consider in the progression of category sorting problems. 
While in discriminant analysis some part of the structure 
is known and missing information is estimated from labeled 
samples, the operational objectives of clustering is to 
classify new observations, that is, recognize them as members 
o£ one category or another. In cluster analysis little or 
nothing is known about the category structure. All that is 
available is a collection of observations whose category 
membership are known. We seek to discover a category 
structure which fits the observations. The problem may be 
stated as one of finding the "natural groups", which means 
to sort the observations into groups such that the degree of 
"natural association" is high among members of the same 


group and low between members of different groups. 
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Cluster analysis techniques have been applied in many 
fields of study. The literature is both voluminous and 
diverse, the terminology differing from one field to 
another. "Numerical taxonomy" is frequently substituted 
for cluster analysis among biologists, botanists, and 
ecologists, while some social scientists may refer "typology". 
Other frequently encountered terms are pattern recognition 
and partitioning. While discriminant analysis has been 
studied by statisticians for nearly 45 years, cluster 
analysis has only recently come to statistical notice. Any 
method which partition a set of objects into subsets on 
the basis of measurements taken on every object qualifies 
as a clustering method. 

Most of the well known clustering techniques fall into 
one of two main categories: (1) hierachical and (2) non- 
hierachical (partitioning). The former is one in which 
every cluster obtained at any stage is a merger of clusters 
at previous stages. The nonhierachial procedures however 
form new clusters by lumping and splitting old ones. We 
consider both categories shortly. 

In a geometric sense, every observation may be viewed 
as a point in p-dimensional Euclidean space. This swarm of 
data points may contain dense regions or "clouds" of data 
points which are separable from other regions containing a 
low density of points. These denser regions constitute what 
are known as clusters. In one and two dimensional cases, it 


is easy to visualize and to detect the clusters from scatter 
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plots, assuming that the clusters exist. In higher 
dimensions, clustering becomes extremely difficult without 
the aid of a computer. 

Mathematical clustering techniques usually require a 
measure Of similarity to be defined for every pairwise 
combination of the entities to be clustered. In order to 
solve the cluster problem, it is desirable to define the 
terms "Similarity" and "difference" in a quantitative 
fashion. A researcher would assign two observations to 
the same group if the distance between them is sufficiently 
Small, or to different clusters if this distance is 
sufficiently large. 

At this point, two questions may be brought on. The 
first one is "how do we measure the distance between the 
observations?" and the second oreis "how small is small 
enough?" and how large is large enough? These will be 


discussed in the following sections. 


B. MEASURES OF DISTANCE 
-. General 
Let EO be a symbolic representation for a 
measurement in p-dimensional space and let X,Y, and Z be 
any of these points in ED . Then any nonnegative real- 


valued function D(X,Y) satisfying the following conditions 


qualifies as a distance function (or metric). 
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D(X,2) + D(Y,Z) 


Many clustering algorithm assume such distances given and 
set about constructing clusters of objects within which the 
distances are small. The choice of distance function is no 
less important than the choice of variables to be used in 
Poe study. A serious difficulty in choosing a distance lies 
in the fact that a clustering structure is more primitive 
than a distance function and that knowledge of clusters 
ehanges the choice of distance function. Thus a variable 
that distinguishes well between two established clusters 
should be given more weight in computing distances than a 


"junk" variable that distinguishes badly. 


Bas eeuclidean .pis tance 
The Euclidean distance between the I-th and K-th 
observations of a data matrix X is defined as 


D(I,K) = (Kl) - (32) 


y 
Sd <p 


wnere J is J-th variable. In one, two, or three 
@imensional space, this is just a “straight line™ distance 
between the vectors corresponding to the I-th and K-th 
observations. When the variables are measured in different 
units, it is necessary to prescale the variabes to make 
their values comparable or, equivalently, to compute a 


weighted Euclidean distance. 
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2]1/2 
D(T,K) = W(J)(X(T,J) - (X(K,J)) (33) 
Bs J <p 


This form of distance is not necessary if all variables are 
measured on the same scale. However, even in this case, 
weights might be used to increase or decrease the importance 
of same variable. Various weighting schemes have been 
utilized in practice. One common weighting scheme lets 
W(J) be the reciprocal of the variance of variable J 

A general class of squared distance functions is 
provided by utilizing positive definite quadratic forms. 
Specifically, if p represents a p-dimensional observation 
to be assigned to one of s_ groups, then to measure the 
squared distance between the observation £8 and the 
centroid (mean vector) of the i-th group one may consider 


the function 


erate a 


Dee = (8= x. ao 


1. 


) ) 


where M is a positive definite matrix to ensure that 
Eo . Different distance functions are represented by 
different choices of the matrix M . When M=I_ (the 
iaentaty matrix) the resulting,metric is. the,standard 
Euclidean distance. Distances with the Euclidean metric are 
shown in Figure la. The variance within the data may make 
the unweighted Euclidean metric inappropriate. As shown on 


the Figure 1b, where X has a larger variance than Y , 
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one may wish to weight a deviation in the xX direction less 
than an equal deviation in the Y direction. This is a 
weighted Euclidean distance frunction which makes point A 
and B equidistance from the origin. In this case, the 
matrix M is diagonal elements which are the reciprocals of 
the variances of the different variables. 

Extending this idea further, it may be possible to 
consider the covariance among variables as well. Figure lc 
shows how the axis may be rotated so that the major axis is 
oriented in a direction of reflecting the positive correla- 
tion between X and Y . Again, points on the same 
ellipse are considered equidistance from the origin. The 
matrix M in this case is the inverse of the covariance 
matrix. 

Further extension of this concept will expalin some sort 
of generalized distance function. If C; represents the 
covariance matrix of the ith cluster then the distance 


function 


uses the appropriate covariance structure when determining 
ene “distance to a particular cluster centroid. Since C5 
changes to reflect the dispersion internal to each particular 
€luster, the use of this metric exploits differences in the 
dispersion characteristics of the different groups. As shown 


on Figure ld, not how a new observation (denoted bya) 1s 
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lc. Generalized squared ld. Classification when within- 
distance measure. group dispersions are different. 


Figure 1. Euclidean Distance 
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closer to the centroid of group one (Gl) in terms of 
Euclidean distance but is more likely to be assigned to 


group two (G2) when using the C,; matrix. 


3. Mahalanobis Distance 
Another choice for the M matrix in equation (1) 
: =f 
Ss Dp where P represents the pooled within groups 


covariance matrix of all the clusters. 


P = (34) 


where 


k 


This distance is the well known Mahalanobis distance. Note 
that P does not change from group to group. To ensure 

fie monm-singularity of P it must be-true thatip <j(N - @, 
where N represents the total number of observations over 


all groups. Rewriting the distance, 


Be Neem) (35) 


Di — (B = x.) 5 i 


a 


defines a distance between mean vectors 8 and X5 and 
common covariance matrix W . The Mahalanobis distance 
function adjusts for both scale of measurement of the 


variables and covariation among the variables. Use of this 
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metric is equivalent to computing distances on variables 
transformed to their principal components. This metric is 
invariant under any nonsingular transformation of original 


variables. For consider the transformation 


Yeas (36) 


and let D(Y;5Y5) represent Mahalanobis distance between 


Y; and Y; 


TL ROG 
D(Y;,¥5) (Y, * Y;) Py (Y; Y;) 


it 
(BX; - BX;) Pyi(BX; - BX;) 


SO i Be ey eB kien 

= (x, - X,)7 BP (PyB) BO, X;) 
= (K) i perarqnicetz.) 

= D(X; 5X5) 


Some other common metrics are listed below: 


1,..L, .nerm, (City Block) 


1 


= 4 


imo 


D(X; »X;) ms 1s re 


Lis L, norm (Minkowsky Metrics) 


Ban 


= : p,l/p 
D(X; ,X5) = 4 Xk ing Fo 
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C. HIERARCHICAL CLUSTERING 


fe General 

The previously discussed distance measures may be 
used to construct a similarity matrix describing the length 
of all pairwise relationships among the entities (variables 
or data units) in the data set. The methods of hierachical 
€1uster- analysis operate on this similarity matrix to-con- 
struct a tree depicting specified relationships among the 
entities. As shown on Figure 2, the branches on the left 
each represent one entity while the root represents the 
entire collection of entities. Moving down the tree 
from the branches toward the root depicts increasing aggre- 
gation of the entities into clusters. Hierarchical 
clustering methods which build a tree from branches to 
root often are called agglomerative methods. 

Once a tree is constructed for N entities, the 
analyst may choose from as many as N sets of clusters. 
These clusters are nested. From the agglomerative view, 
when two entities are merged they are joined togehter per- 
manently and considered as one entity for later merges; from 
the divisive view, when a group of entities is split into 
two parts, the parts are separated permenently and may be 


treated independently for the remainder of the analysis. 
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Figure 2. Tree for Hierarchical. Clustering 


Herein lie both the strength and weakness of 
hierarchical methods: by taking early decisions as perma- 
nent, the number of posibilities that need be examined is 
reduced greatly as compared with complete enumeration; 
but this same convention precludes discovering early 
Mmestakes OF Capitalizing on later opportunities. 

There are three major hierarchical clustering concepts: 

1. Linkage Methods 
2. Centroid Methods 


3. Error sum of squares or variance methods. 


All of these methods are suitable for clustering data units. | 
However, only the linkage methods are considered in this 
research. 


2. The General Agglomerative Procedure 


Let Sij be the similarity between entities i 
and j as defined by one of the distance measures previously 
discussed. Assuming that the similarity is symmetric, the 


complete schedule of similarities for. ail a = INN - 1) 
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possible pairwise combinations of entities may be arrayed 

in a lower triangular similarity matrix as in Figure 3. 

The Si entries are nonnegative. This limitation is of 
consequence only for correlation and the cosine of the angle 
between vectors; the distinction between positive and nega- 


tive association cannot be utilized in these clustering 


methods. 


Figure 3. Lower Triangular Similarity Matrix 


A simple remedy is to use the absolute value or the square of 
the measure if it can assume negative values. Once the 
matrix is defined, the process of clustering entities is 
almost trivially simple. The general procedure for agglomer- 
ative clustering on a data matrix is as follows: 
(lL) Begin with n clusters each consisting of 
exactly one entity. Let the clusters are 


labled with the numbers 1 through N. 
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(2) Search the similarity matrix for the most 
Similar pair of clusters. Let the chosen 
clusters be labeled p and q and let 
their associated similarity be Sq? pie Mar 

(3) Reduce the number of clusters by 1 thorugh 
merger Of Clusters. p and @q. Label. the 
product of the merger q and update the 
Similarity matrix entities in order to 
reflect the revised similarities between 
cluster q and all other existing clusters. 
Delete the row and column of S_ pertaining 
tesclusters:p 

(4) Perform steps 2 and 3 a total of N-1 times 
(at which point all entities will be one 
cluster). At each step record the identity 
of the clusters which are merged and the value 
of similarity between them in order to have 


a complete record of the results. 


Different agglomerative methods are implemented by 
varying the procedures used for defining the most similar 
pate at, step 2 and for updating the revised similarity 
matrix at step 3. The similarity matrix is a given array 
of numbers. The numerical execution of the clustering 
procedures is completely independent of how the similarity 
values were generated or whether the entities to be 
clustered are variables or data units. However, it is 


necessary to make a direct distinction between distance-like 
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measures (the smallest values correspond to the most similar 
pairs) and correlation-like measures (the largest values 
correspond to the most similar pairs); the essential 
difference is whether the search for the most similar pair 
involves seeking the minimum or maximum entry in the simi- 
Parity matrix. 
3. Single Linkage 

The method of single-linkage cluster analysis is 
the simplest of all hierarchical techniques. At each stage, 
after clusters p and q have been merged, the similarity 
between the cluster (labeled t) and some other r is 


determined as follows: 


lings Ss Sij is the distance-line measure 
S,, = min (Spy Sqr) (37) 
The quantity S,, is the distance between the two closest 


members of clusters . t |sandsmn-:c-Dfilelusters, t and r 
were to be merged, then for any entity in the resulting 
cluster the distance to its nearest neighbor would be at 


most Str 


Pl Tee ¢-- “Is a’correlation-like measure 


Sixst) (38) 


Ss = max (Sor? qr 


ba 


The quantity s is the similarity between the two most 


iar 
similar entities in clusters t andr. tr -crusters = t 
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and r were to be merged, then for any entity in the 
resulting cluster there would be at least one other entity 


in the same cluster such that the pair would have a similarity 


at least as large as Sep 

The method is known as single linkage because clusters 
are joined at each stage by the single shortest or strongest 
link between them. Since the updating process involves 
choosing only the minimum or maximum single-linkage clustering 
is invariant to any transformation which leaves the 


ordering of the similarities unchanged; that is, any monotonic 


transformation. 


4. Complete Linkage 


The complete-linkage method is related to the single- 
linkage method and is no more difficult to execute. At each 
stage, after clusters p and q have been merged, the 
Similarity between the new cluster (labeled t) and some 


Other cluster r is determined as follows: 


| CS a Sij is distance-like measure 
S,, = max (Spr?Sqr) (39) 
The quantity s,_ is the distance between the most distant 
members of clusters’ tt” and’ ¥r™. ‘Tf clusters "¢°-and r 


were merged, then every entity in the resulting cluster 


would be no farther than s,, from every other entity in 


the cluster. The value of Str is the diameter of the 


samllest sphere which can enclose the cluster resulting from 


the merger of clusters t and r. 
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Tage Tf Sij is a correlation-like measure 


Ss = NSIT (ehce yt a) (40) 


The quantify Str is the similarity between the two most 
dissimilar entities in clusters t and r. If clusters 
t and r were to be merged, then every entity in the 
resulting cluster would have a similarity of at least Str 
With every other entity in the cluster. 

The method is called complete linkage because all 
entities in a cluster are linked to each other at some 
maximum distance or minimum similarity. Such a cluster 
is called a "maximally connected subgraph" in graph theory. 
In contrast to the single-linkage method, interpretation 
of the clusters can be made only in terms of the relation- 
ships within individual clusters; there is no particularly 
useful interpretation involving the differences between 
clusters. Like the single-linkage method, complete-linkage 
cluster analysis is invariant to monotonic transformations 


of the similarity measure. Johnson (1967) discusses this 


property in both single and complete linkage methods. 


D. NONHIERARCHICAL CLUSTERING 

Nonhierarchical clustering methods are designed to 
eluster data units. into a.single classification, of..¢ 
elusters, where g either is.specified a.priori,.or is 
determined as a part of the clustering method. The central 


idea in most of these methods is to choose some initial 
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partition of the data units and then alter cluster member- 
ships so as to obtain a better partition. The various 
algorighms which have been proposed differ as to what 
constitutes a "better partition" and what methods may be used 
for achieving improvements. 

The broad concept for these methods is very similar 
to that underlying the steepest descent algorithms used 
for unconstrained optimization in nonlinear programming. 
Such algorithms begin with an initial point and then 
converge to a local optimum, moving one step at a time, 
the value of the objective function improving at each step. 

The methods of nonhierarchical clustering typically 
may be used with much larger problems than the hierarchical 
methods because it is not necessary to calculate and store 
the similarity matrix; it is not even necessary to store 
the data set. In general, the data units are processed 
serially and can be read from tape or disk as needed. This 
@varacteristic makes it possible, at least in principle, to 
etuster arbitrary large collections of data units. 

In this research, we consider only the partitioning 
method known as ''K-MEANS" which was developed by MacQueen 
(15). He used the term "K-MEANS" to denote the process of 
assigning each data unit to that cluster (of k clusters) 
with the nearest centroid (mean vector). The cluster 
centroid changes with each transfer of an observation. 

The decomposition of the total scatter matrix into 


within and between groups matrices suggests possible 
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optimality criteria to be used in a Clustering algorithm. 
One would like the within-groups scatter to be small 
relative to the between-groups scatter. Various trial 
clusterings could be formed using the W and B matrices 
as a basis for the optimality criteria which determine the 
best clustering. A possible choice for a criterion is to 
minimize trace W over all partitions into g groups. 
since TI is constant: over all partitions, minimizing trace 


W is equvalent to maximizing traces B_ since 


trace T = trace W + trace B (41) 


Although trace W is invariant under an orthogonal 
transformation, it is not invariant under other non-singular 
linear transformations. 

McRae (16) points out that trace W equals the total 
within group sum of squares, hence the 'minimum variance 
partition" cluster solution is found by minimizing trace W 

Considerable study has been developed to alternative 
criteria such as those based on multivariate statistical 
analysis techniques, especially the methods of linear 
discriminant analysis and multivariate analysis of variance. 
Assuming the p variables are not linearly dependent, then 
aseboneg as p= N--¢ , W wWsi positive definite symmetric 

1 


and so is W~ . Attempts to make B and W as different 


as possible lead one to solving the determinantal equation: 


|B - AW| = 0 (42) 
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The, solutions fs are the eigenvalues of the matrix wtp 


as in discriminant analysis. There are t non-zero 
eigenvalues, where t is the minimum of p and g-l 

Pits, 15.2. consequence of. the fact, that, if .¢,.is.less than 
p ,.the g group means are considered in a (g-1)-dimensional 
hyperplane. When g = 2 the analysis is equivalent to 
two-group discriminant analysis. Linear discriminant 
analysis would take the vectors originally described in 
p-dimensional coordinate system and transform the basis to a 
t-dimensional system. Maximizing the largest of these 
eigenvalues is a criterion suggested by S.N. Roy and 
maximizing the trace of wip » however is a criterion 
suggested by Hotelling. In both cases, large values for 
these statistics are sought in clustering algorithms since 
large values indicate large differences among (between) 
groups. Minimizing the ratio of determinants |W| += |T| 

is a criterion widely known as Wilks' lambda discussed in 
oie discriminant analysis. Since T is the same for all 


partitions, this criterion is equivalent to minimizing 


determinant W . Both trace wtp and [T| + [wl may 
be expressed in terms of the eignevalues of wtp 
T t 
lage 
-1 t 
trace Woe = "2b ds (44) 
i=1 
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where t = min(p, g-1) . Therefore minimizing det W is 
equivalent to maximizing (1 + hj). 

Friedman and Rubin (6) describe the advantages of the 
various criteria. Those based on multivariate statistical 
considerations (all but trace W ) are invariant under 
ewanees in scale for varibles (non-singular linear trans- 
formation). In fact, they are the only invariants for WwW 
adeeb under such transformations. In addition, the 
multivariate criteria may take into account covariation 


among the variables. 
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V.__ ANALYSIS OF MULTIVARIATE UTILITY DATA 


To illustrate hierarchical clustering we applied the 
technique described in the previous chapter to partition a 
set of twenty six attributes of a close-air support weapon 
system into a smaller collection of "superattributes". As 
part of an effort to evaluate the military utility of a 
proposed alternative U.S. Marine Corps air support rada 
system, AN-TPQ/27. Barr and Richards (4) extracted 26 
attributes of the TPQ-27 and a baseline system, the AN-TPQ/10, 
and then had members of the Operational Test and Evaluation 
ieamoassess the utility of the TPQ/2Z7 relative to that of the 
TPQ/10. In order that the additive model used to combine 
unidimensional relative utilities into a system relative 
utility be justifiable, it is necessary that the utilities 
satisfy certain independence properties described in Keeney 
gna Raitffta (12). 

Because those independence properties are very diffi- 
cult for decision makers to verify for complex alternatives 
like the weapon systems under study, Professors Barr and 
Richards attempted instead to work with the attributes to 
try to generate a new collection which would likely satisfy, 
at least approximately, the conditions required to justify 
the additive model. 


The original collection of 26 attributes is as follows: 
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Portability 

Durability 

Time to Set Up 

Time to Take Down 

Ease of Assigning Aircraft to Targets 
Number of Aircraft Controlled 
Number of Targets 

Communications 

Mission Flexibility 

ASRT Survivability 

Time to Locate and Acquire Aircraft 
Accuracy of Tracking 

Accuracy of Delivery 

Range 

Aircraft Vulnerability 

Aircraft Attack Throughout 

Base of Adjustment and Evaluation of Results 
Accuracy of Feedback 

Ease of Operation 

Man-Machine Compatibility 

Training Requirements 

Reliability 

Maintainability 

Supportability 

Availability 


Documentation 
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where a; represents an attribute i and 


Z ee ky 
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It is easy to verify that D is a metric as defined in 
Chapter IV. Since we will actually work with a similarity 
measure in the hierarchical cluster procedure, we define 


the similarity between two attributes as and a, as 


S (a: 


3) 2) = 


tue 


I(x; ; - Xj) (46) 


One can see from this definition that the similarity between 
two attributes as and ay is simply the number of team 
members who placed attributes as and a, in the same 


Pactition.” For example, 


S(a, 5: a5) = 0° f+ 0 +71 +0 + 0 * O + F +.0 + 0 F0 
+1+0= 4 


Either S or D can be used in the computer program 

shown in Appendix A for hierarchical clustering. One need 
only indicate whether he wants a correlation-like (larger 
values imply more similar) measure or a distance-like 
measure (smaller values imply more similar). We selected to 


use the former method. The similarity matrix extracted from 
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Data Matrix 


Table lr: 
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from the data is shown in Table V-3. We present only lower 
triangular elements since S(a; ; a; ) = Lae forwaliwe sis cand 
Ghetmatrixens symmetric ise. S(a; ; ay) = S (a, P aj). 
Zero values are not written. 

The results from the hierarchical clustering are shown 
in Figure 4. The numbers printed along the left hand 
margin refer to the attribute numbers. As you proceed to 
the right through the tree you will observe numbers greater 
chan 265) These correspond to the clusterings that takes 
place from one step to the next. For example, the number 
27 shown at the juncture of 25 and 22 means that the first 
attribute clustered together should be 25 and 22 (this is 
the most similar pair). This combination is then considered 
as a new attribute which is later combined with the attribute 
50° (1tselfeascombination of 23+and:24)) to-form the: attribute 
SistctThissis later combined with attribute 2° to form 
attribute 40, etc. 

As discussed in Chapter IV a decision has to be made as 
to how many clusters (superattributes) are desired. All 
hierarchical methods will continue clustering until there 
is a single cluster. In order to decide on the number of 
clusters (and their composition) one need only image drawing 
a vertical line through the tree at various places. Each 
intersection of the tree with the vertical line results ina 
cluster. For example, teh vertical line at the point A 


results in the 6 clusters shown in Table V-4. 
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It is clear from observing the above collection that 
some of the attributes are highly correlated and nonredun- 
dant. If one tries to assign an importance weights to each 
attributes separately, there is a distinct likelihood that 
some of the overlapping strongly into related attributes 
might effectively be double or triple weighted or more 
producing, biased, result... It is,an.effort,.to prevent, this 
from happening, Barr and Richards aksed the utility assess- 
ment team to partition the 26 attributes into a smaller 
collection in such a way that attributes within a group are 
Similiar and attributes in different groups are unrelated the 
sense that utility assessments for attributes in one group 
do not depend on the amounts of attributes in any other 
group. 

The total number of groups was not prespecified. 
Instead, each team member was allowed to partition the 26 
attributes into any number of groups. The resulting multi- 
Wattate data array is shown in Table V-2.. An element Xi j 
is the number of the group into each team member j _ put 
atéribute-—.i 


Let us define a distance measure for this data array 


as follows: 


(1 = I(x;; < X45) (45) 
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Table III. Similarity Matrix for Superattribute Determination 
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Figure 4. 


Tree for 26,.Attributes 
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The superattributes used in the utility study are those 
sohwn in Table IV. A careful examination of the attributes 
which comapre the clusters shows that the results so obtained 
are intuitively agreeable. The names supplied to the 
superattribures are somewhat natural descriptions of the 
clusters obtained. 

The listing of the computer program and sample output 


are given in Appendix A. 
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Table IV. Superattributes 


Superattributes Component Attributes 


Facility of movement baw Portabilaty 
Se, Lame £6 Set-up 
4 


Time to take down 


/-eewreerewr ere kee emg ewe wm we Xe ew eM ew MW we ew WP we eM eM eke ewe ewe ke eM eM MP ew ee ew MP MP ew ew Me ew ew eM ewe ew ew ew Me eke eke ke ee ew eK KK 


Baciiney of Use Don Basesok asistening 
aircraft to targets 

(precision) Gane iNumber-of ‘aireraft 
controlled 


7. Number of targets 
o. Masson) flexibility, 
il... Time €o.locate and 
acquire aircraft 
iG; Aireraft attack 
throughput 
17. Ease of adjustment 
18. Accuracy of feedback 
19 Ease of operation 
20. Man-machine compatibility 
IZ. Accuracy of tracking 
13. Accuracy of delivery 
14. Range 
Survivability 105 ASRE survivability 
15. Aireraft vulnerability 
Learning 2i, Training requirements 
26. Documentation 
Readiness 2. Durdbality, 
22. Reltability 
Za. ° Maintainability 
24. Supportability 
Za. Availability 
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VI. ANALYSIS OF ARMY TANK DATA 


A. DATA STRUCTURE 

In order to illustrate the nonhierarchical clustering 
methodology, principal components analysis, and discriminant 
analysis data on Army tanks from eight different countries 
were taken from Jane's Book of Weapon Systems (1979-80). 
A total of twenty-four tanks were included in the data 
array with observation on each of 10 variables. The 10 
variables are listed below: 

1. Weight (ton) 

2. Length (meter) 

3. Width (meter) 

4. Height (meter) 

5. Road Speed (kilometer per hour) 

6. Trench Crossing (meter) 

7. Ground Pressure (Kg/cm*) 

8. Maximum Armament (rounds) 

9. Ground Clearance (meter) 


10. Power to Engine Ratio (BHP/ton) 


The twenty-four tanks and the associated countries are 


shown below: 


Identification Number Type/Name Country 
iG T-62 
WZ T-54 Bsc. Syethes 
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Identification Number 


13 
14 
15 
16 
aly 
18 


iS 
20 
7a | 
Z2 
76S 


24 
25 


26 
oi 


28 
29 


30 
ot 
52 


53 
34 


Type/Name 


T-10 

ASU-85 
MK-5/Chieftain 
MK-3/Vickers 
MK-13/Centurion 


GCVR(T)/Seorpion 


XM-1 
M60A2 
M60 
M48 
M47 


PZ61 
PZ68 


STRV-103 
Ikv-91 


TYPEG? 
TYPE74 


Leopard 2 
Leopard 
TAM 


AMX 30 
AMX 13 
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Country 


U.K. 


U.S.A. 


SWITZER- 
LAND 


SWEDEN 


JAPAN 


W. GER- 
MANY 


FRENCH 


anit eh > ix 


We conjecture that a cluster analysis of the tank data will 
cesult in.clusters corresponding to nationality since the 
nations may have different emphasis on the variables in 


the design of their tanks. 


B. NONHIERARCHICAL CLUSTER ANALYSIS OF TANK DATA 
1. The MIKCA Algorithm 
The specific algorithm chosen for the nonhierarchical 
cluster analysis for the tank data is the MIKCA (Multivariate 
Iterative K-MEANS Clustering Algorithm) program written by 
Douglas J. McRae as a part of his doctoral dissertation 
geeehe WWniversity of North Carolina, Chapel Hill. 

Reference to the flow chart in Figure Swill. aid.the 
reader in following discussion of the algorithm. Inputs to 
program are the data matrix, an estimate for g (the 
number of clusters), and choice of criterion and distance 
functions. 

In the first step, preliminary claculations are made, 
such as the variable means and standard deviations, as 
Well-as the cross product-=matrix T . The next step~forms 
the initial cluster centers. Then each of the other 
observations is assigned to the nearest cluster. Euclidean 
distance is used for this initial phase, and the cluster 
centroids are recomputed after each observation is assigned 
toa group. The observations are considered in the same 
order as they were input. After all of them have been 
assigned to clusters, the criterion value is computed. 


This initial cluster-finding technique is referred to as a 
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Figure 5. MIKCA Flow Chart 
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one-pass K-MEANS procedure. It is performed three times, 
and the solution which yields the best criterion value is 
Enosen as the inttial clustér’solution. 

After the initial solution has been found, the program 
advances to the iterative K-MEANS phase where the observations 
are again considered in the order in which they were input to 
the program. It is this phase where the user's choice of 
distance function is used. The distance from each observa- 
tion to each cluster centroid is again computed, this time 
with the user's distance function, the assignment to the 
closest centroid being made and the centroid updated to 
reflect its new membership. After considering all n 
observations in this manner, the new criterion value is 
checked for possible improvement during the K-MEANS iteration. 
As long as the criterion value improves, the K-MEANS 
Procedure, 1s mepeated: if. the criterion fails to improve then 
the MIKCA algorithm goes to the next step, the individual 
switches section. Note the importance of the order of 
consideration of the observations. The order is important 
because the cluster means are recomputed after each observa- 
tion is reassigned. 

In the individual switches phase, consideration is given 
to moving each observation to every other cluster, the move 
being made if and only if an improvement in the value of 
the criterion results. An elaborate labelling procedure 


provides a unique order in which to consider each observation. | 
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This procedure continues until a complete pass through the 
data is made with no changes in cluster membership. 

The MIKCA alogorithm provides the following options for 
drstance andvcriterion functions. 

Criterion 

1. Minimum trace W 


2. Minimum determinant W 


3. Maximum largest order of |B - AW| = 0 
4. Maximum sum of roots of [|B - AW| = 0 
Distance 


i? Euclidean 
2. Weighted Euclidean 
3. Mahalanobis 


A complete computer program is listed in Appendix B. 


Z.. Cluster Results for Tank Data 
For clustering of the tank data we selected the 
minimum trace W criterion and the weighted Euclidean 
distance function. The algorithm automatically provides 
weights for the weighted Euclidean distance function. 
The results of the clustering with four clusters are shown 
in Table ¥. 

The conjecture of clustering by nationalities is 
supported by the results. The three Soviet tanks make up 
one cluster and the two British and four of the United 
States tanks were found to be similar. A third cluster 


consists of four tanks which are very lightweight. The 
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final cluster consists of the rest of the tanks, including 
tanks of United States allies from West Germany, France, 
Sweden, Switzerland and Japan. 

A natural question to ask after observing the results 
of a cluster analysis is what variables most strongly 
influence the clustering that was observed. A clue is 
provided by the composition of the cluster containing all of 
the lightweight tanks. This suggests that weight is an 
important distinguishing feature. This is examined in the 
principal components analysis and the discriminant analysis 


in the next two section. 


C. PRINCIPAL COMPONENTS ANALYSIS 

The Statistical Package for Social Sciences (SPSS) 
(14) subprogram FACTOR was used for the principal components 
analysis. It is designed both for the factor analysis and 
the principal components analysis. The outputs are designed 
fo begseclt-explanatory. In this example, the first 5 
components accoung for 90% of the variance and the remaining 
components account for only 10% of the variance (Figure 6). 

The subprogram FACTOR provides a graphical presentation 

(Figure 7) for the factors that have been determined by the 
orthogonal rotations (in this example, variance maximization 
rotation). In reading the graphs, one should be attentive 
to following three features: (1) the relative distance of 
a variable from the axis, (2) the direction of a variable 
im eelationsto the axis, and finally (3) clustering.of 


variables and their relative position to each other. 
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For example, variables 5 (road speed) and 10 (power to 
emeine ratio) “contribute heavily to the first principal 
component while variables 1 (wieght) and 3 (width) 
contributes most strongly to the second principal component. 
Mardaples 2,4, 6, 7, 8, 9 are not 4S important. The 
weights accorded each variable in the 10 factors (principal 
components) are shown in Figure 8. The complete SPSS 


program is listed in Appendix C. 


D. DISCRIMINANT ANALYSIS 

The SPSS subprogram DISCRIMINANT was used to determine 
that function or those functions of the 10 variables that 
best discriminant among the four clusters determined in 
previous section. 

The maximum number of discriminant functions to be 
derived is either one less than the number of groups or equal 
to the number of discriminating variables. This subprogram 
provides two measures for jUiging the importance of 
discriminant functions. One of these is the relative 
percentage of the eigenvalue associated with the function. 

It is a measure of the relative importance of the function. 
The sum of the eigenvalues is a measure of total variance 
existing in the discriminating variables. Since discriminant 
funetions are derived in order of their importance, this 
process can be stopped whenever the relative percentage is 
guaged to be too small. Of course, there 1s no fixed rule 
for deciding whatis too small. In this research, we selected 


SeuLerary., d significance level of 0.10. The output shown 


15 


7 i 
oF towed) OF wathee 3 
as rt " rane 
. leqgionrrg fet? a 
a : i 7 is - 
(ttkiw) t bee ey s 
4 
.tnenognes Legtisaigq baosa 
pFIG 1 q , *) oi on ly 
1T ,sMerroqat t& For ae 
. a ak 
piagsTT? SIOIIR, 1p ont at haesciee 
: wal 
la etelques al  wrugl vt ng reo 
4 
7 _ 
) etbawegyh, 
= >= 
a4 
teb cf becn esw TKAMIMIS0ELO we 
: 7 
| ’ laisev Of et, Io Enotsonwt sais To 
§ o! Ta r2uL3 rurot ied rom 
< 
f ae rn é rat J “2 2} to TScH i sm Ex 


sqotp to redennm oft sada? ezel ee ipa bi 


naipotgdes ei zaldalvey ashi ain uuseih’ to sian. 
conel ym antl anig a +ot sotuane eee bt 


— 


[JeaisSt ot? ef weeds Se eat pi roi 


soowe oft doiw Botatoeete sutkvatgis ods 26 


a 
Stu 


oY 
3 s7n4&7% ome wee alow ods 30. 
7 is 
is | & 70? Bae 77 tle S38 x ai seutevangi 
: - 4 


selb somie ,esidsivay ptisaisteryoe 5 


tqt os nae spies cieds as) Baie veya 


2 "4 | a: 


nesteq a¥ i sate als 100 ae soda ei 
® on 23 vedt — o item , 1 


er 


we ety r 
bstoeice ow ,doxeoeee Gite af ak eke 
_—) a 7 a 
Sends sues eet Ate to “2 
’ - yy & 3 7 
a, ay : 7 in 


in Figure 9 suggests that we therefore consider only the 
forest two discriminant functions. 

The second measure judging the importance of a 
discriminant function is its associated canonical correlation. 
The canonical correlation is a measure of association between 
the single discriminant function and the set of (g-1) dummy 
variables which define the g group memberships. It tells 
us how closely the function and the group variable are 
related, which is just another measure of the function's 
ability to discriminate among the groups. From Figure 10, 
the first two discriminant functions are each highly corre- 
lated with the groups but the third has only a moderate 
correlation. 

The next criterion for eliminating discriminant 
himettons is to test for the statistical-significance of 
discriminating information not already accounted for by 
the earlier functions. As each function is derived, 
starting with no (zero) functions, Wilks' lambda is computed. 
Lambda is an inverse measure of the discriminating power in 
original variables which has not yet been removed by the 
discriminant functions - the larger lambda is, the less 
is the information remaining. Lambda can be transformed into 
a chi-Squame statistic for an-easyitest of statistical 
significance. In Figure 9, Wilks' lambda was .594 after 
the first two functions had been derived. This corresponds 


to a chi-square of 8.8476 with a probability level of .1823. 
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This means that a lambda of this magnitude or smaller has a 
.1823 probability of occurring due to the chances of 
sampling even if there was no further information to be 
aecounted for by a third function in the population. 
eveorty, a third function is not statistically significant 
an this case. 

The standarized discriminant function coefficients 


corresponding to the values of the v..'s discussed in the 


ij 
previous section are used to compute the discriminant score 
for a case (observation) in which the original discriminating 
variables are in standard form. The discriminant score is 
computed by multiplying each discriminating variable by its 
corresponding coefficient and adding together these products. 
There is a separate score for each observation on each 
function. The coefficients have been derived in such a way 
that the discriminant scores produced are in standard form. 

When the sign if ignored, each standard discriminant 
function coefficient represents the relative contribution of 
its associated variable to that function. The sign merely 
denotes whether the variable is making a positive or:negative 
contribution. 

A graphical presentation is shown in Figure 11 using 
the first and the second canonical discriminant function as 
the axis. From this scatterplot, we can easily see that 
Soviet tanks (labelled 1) are well distinguished from the 


all of the others using only the first two discriminant 
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functions. Also, all the lightweight tanks are clearly 

separated from the others. The distinction between groups 
2 and 3 is also clear though not separated from each other 
as much as from groups 1 and 4. The complete SPSS program 


for the discriminant analysis is listed in Appendix D. 
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VII. CONCLUSION 


The multivariate analysis techniques of cluster analysis, 
principal components analysis and discriminant analysis are 
useful in real world problems for examining observations 
on each of several dimension. Each of the techniques is 
related mathematically to the others, and each complements 
the other in explaining the data. 

Computer software is readily available in many sources. 
The software used in this thesis for hierarchical clustering, 
principal components analysis, and discriminant analysis was 
from the IMSL package and SPSS. For nonhierarchical 
clustering, we used the FORTRAN program developed by McRae 
(16). All of this software is readily available and 


documented at the Naval Postgraduate School. 
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